The massive data volumes that are very command in a typical Hadoop deployment make compression a necessity. Data compression definitely saves us a great deal of storage space and it also makes sure to accelerate the movement of that data throughout our cluster. It’s not a big surprise that a numerous available compression schemes, called codecs, are out there for us to consider.
In a
Hadoop deployment, we are dealing (potentially) with quite a big number of
individual slave nodes, each one of them consists of a number of large disk
drives. It’s very usual that a particular slave node to have more than 45
terabytes of raw storage space available for HDFS. Besides this, Hadoop slave
nodes are engineered to be inexpensive, they’re not free, and with massive
amount of data that have a tendency to expand at increasing rates, compression
is a must use tool to control extreme data volumes.
Let’s learn some basic compression terminologies:
Codecs:
A
codec, which is acronym for compressor/
decompressor, is technology (software or hardware, or both) for compressing and
decompressing data; it’s basically an implementation of a
compression/decompression algorithm. We need to know that some codecs support
something known as split table compression and that codecs varies in both the
speed with which they can compress and decompress data and the degree to which
they can compress it.
Split table compression:
Split
table compression is one of the key concepts
in a Hadoop context. The way Hadoop works is that files are broken down and
split if they are larger than the file’s block size setting, and particular
file splits can be processed in parallel by distinctive mappers. With most of
the codecs, text file splits would not be decompressed independently of other
splits from the same file, so those codecs are said to be nonsplittable, so
MapReduce processing is very limited to a single mapper. Since, the file can be
decompressed only as a whole, and not as individual parts based on splits, this
might possible that, there can be no parallel processing of such a file, and
performance might take a huge hit as a job waits for a single mapper to process
multiple data blocks that can’t be decompressed independently.
Split
table compression is only a factor while we are with dealing text files. For
binary files, Hadoop compression codecs compress data within a binary-encoded
container, depending on the file type (for example, a Sequence File, Avro, or Protocol
Buffer).
Leave Comment